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ABSTRACT 


The sain objective of this research is to develop an 
algorithm for isolated-word recognition. This research is 
focused on digital signal analysis rather than linguistic 
analysis of speech. Features extraction is carried out by 
applying a Linear Predictive Coding (LPC) algorithm with 
order of 10. Continuous-word and speaker independent 
recognition will be considered in future study after 
accomplishing this isolated word research 

To implement and test the proposed algorithm a 
microcomputer-based data acquisition system has been designed 
and constructed. The system digitizes the voice signal , 
after passing through a 100 c/s-3.8 Kc/s band pass filter, 
with a sample rate of 8 KHz and stores the digitized data 
into a 6 4Kx 10 -dynamic random access memory (DRAM) buffer. A 
squelch circuit consists mainly of comparators (741s85) 
detects the beginning of the spoken word. The end of the 
word is detected by a software algorithm based on comparing 
the speech energy with a precalculated threshold. A flag 
signals the end of sampling and the data is transferred from 
the buffer to an IBM-PC where it is segmented into frames 
each, 30 millisecond long, and the LPC coefficients are 
calculated. 

To examine the similarity between the reference and the 
training sets, two approaches are explored. The first is 
implementing traditional pattern recognition techniques where 
a dynamic time warping algorithm is applied to align the two 
sets and calculate the probability of matching by measuring 
the Euclidean distance between the two sets. The second is 
implementing a backpropagation artificial neural net model 
with three layers as the pattern classifier. The adaptation 
rule implemented in this network is the generalized least 
mean square (LMS) rule. 

The first approach has been accomplished. A vocabulary 
of 50 words was selected and tested, the accuracy of the 
algorithm was found to be around 85%. The second approach is 
in progress at the present time. The topology of the 
backpropagation model consists of three layers: input, 
hidden, and output. The actual output of each node is 
calculated using a sigmoid nonlinearity function of the inner 
product of the weight and the input; the weights are adapted 
by using the formula W^j new = W i j° ld + u EjX* where u is the 
gain factor, Ej is the'' error, and X^ is J the input. The 
network is being simulated on a PC. 
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INTRODUCTION 

For more than a decade the United States Government, 
foreign countries especially Japan, private corporations, and 
universities have been engaged in extensive research on 
human-machine interaction by voice. The benefits of this 
interaction is especially noteworthy in situations when the 
individual is engaged in such hands/eyes-busy task, or in low 
light or darkness, or when tactile contact is 
impractical/ impossible. These benefits make voice control a 
very effective tool for space-related tasks. Some of the 
voice control applications that have been studied in NASA-JSC 
are: VCS Flight experiments, payload bay cameras, EVA heads 
up display, mission control center display units, and voice 
command robot. A special benefit of voice control is in zero 
gravity condition where voice is a very suitable tool in 
controlling space vehicle equipment. 

Automatic speech recognition is carried out mostly by 
extracting features from the speech signal and storing them 
in reference templates in the computer. These features carry 
the signature of the speech signal. These reference 
templates contain the features of a phoneme, word, or a 
sentence, depending on the structure of the recognizer. If a 
voice interaction with the computer takes place, the computer 
extracts features from this voice signal and compares it with 
the reference templates. If a match is found, the computer 
executes a programmable task such as moving the camera up or 
down. 


Several digital signal processing algorithms are availa- 
ble for speech feature extraction. The efficiency of the 
current algorithms is limited by: hardware restriction, exec- 
ution time, and easiness of use. Some of these algorithms 
are: Linear Predictive Coding (LPC), Short-time Fourier Anal- 
ysis, and Cepstrum analysis. Among these algorithms, LPC is 
the most widely used since it is easy to use, has short exec- 
ution time, and do not require large memory storage. 
However, this algorithm has several limitations due to the 
assumptions upon which it is based upon. 

Current speech recognition technology is not sufficiently 
advanced to achieve high performance on continuous spoken 
input with large vocabularies and/or arbitrary speakers. A 
major obstacle in achieving such high performance is the 
limited capability of the traditional pattern recognition 
(classifier) algorithms that are currently implemented. Due 
to this limited capability, and considering the fact that 
humans have a fascinating capability of recognizing the 
spoken words, researchers have started to explore the 
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possibility of implementing human-like models, or what is 
known as artificial neural networks, as the pattern 
classifiers. 

Neural net models have the greatest potential in area 
such as speech and image recognition where many hypotheses 
are pursued in parallel, high computation rates are required, 
ability to learn is desired, and the current best systems are 1 
far from equaling human performance. Most neural net 
algorithms adapt connection weights in time to improve 
performance based on current results. Adaptation or learning 
is a major focus of neural net research. The ability to adapt 
and continue learning is essential in area such as speech 
recognition where training data is limited and new talkers, 
new words, new dialects, new phrases, and new environments 
are continuously encountered. 

LINEAR PREDICTIVE CODING (LPC) 


Feature extraction in our research is carried out 
through a Linear Predictive Coding algorithm. The signal 
corresponding to a spoken word is segmented into frames each 
30 milliseconds long and the algorithm is applied to replace 
each frame with 10 coefficients. In the following, we 
briefly review the LPC algorithm. Details of this algorithm 
can be found elsewhere [1-9]. This algorithm is built on the 
fact' that there is a high correlation between adjacent 
samples of the speech signal in the time domain. This fact 
means that an nth sample of speech signal can be predicted 
from previous samples. The correlation can be put in a 
linear relationship as: 


Y n = a l^n-l +a 2 Y n-2 + * * * +a p Y n-p 


( 1 ) 


where p is the order of analysis. Usually p ranges from 8 to 
12. y n is the predicted value of speech at time n and a*s 
are the linear predictive coefficients. The prediction error 
B n that resulted from the above linear relationship is: 



( 2 ) 


To find the predictive coefficients which give least mean 
square error, the above equation is squared, partially 
differentiated with respect to a's, and time average term by 
term. The result is p equations in p unknowns as shown 
below: 
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where rj is a correlation coefficient of waveform (y n ) and 
r_j = rj by the assumptions of stationary state of y n The 
coefficients a^’s exist only if the matrix in equation 3 is a 
positive definite. To ensure that this condition is 
satisfied, y n is multiplexed by a time window W . This 
multiplexing makes y n exist in a finite interval from 0 to 
N-l, where N is the interval of the Window; a stable solution 
for equation 3 is always obtained. Accordingly, r j can be 
written, as: 


N-j-1 

r j •* ^ < 5 > 

In our study, a Hamming window is implemented. 
Calculation of the correlation coefficients by window 
multiplexing is called the correlation method. a i’ 8 
correspond to the resonance frequencies of the signal, and if 
p, the order of the analysis is selected correctly, these 
ai's represent the formants, frequencies at which peaks of 
the power spectrum of the speech signal occur. A block 
diagram representing an algorithm for voice recognition based 
on LPC analysis is shown in Figure 1 . 
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FIGURE 1. CALCULATION OF THE LPC COEFFICIENTS [5] 
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EXPERIMENTAL DESIGN 

To calculate the LPC coefficients and test the 
performance of the proposed algorithm, a microprocessor-based 
data acquisition system has been designed and constructed. 
Figure 2 shows a block diagram of the system. The system is 



FIGURE 2. A BLOCK DIAGRAM OF THE DATA ACQUISITION SYSTEM. 

designed to lay the foundation for further expansion and 
enhancement for more sophisticated microprocessor-based 
speech identification/ recognition research. The system 
receives the voice signal through a microphone coupled with 
an audio amplifier. The voice signal passes through an active 
multiple feedback bandpass filter with a 3db bandwidth of 3.5 
KHz approximately. The output signal of the filter is 
applied to the analog-to-digital converter where it is 
digitized with a sampling rate of 8 KHz. A squeloh circuit 
is constructed to detect the beginning of the utterance and 
accordingly activates a temporary storage buffer to store the 
digitized data. The buffer consists of DRAMS with maximum 
capacity of 64 Kbyte. A hardware flag (the output bit of a 
flip flop) signals the end of sampling and the data is 
transferred to the microprocessor where digital signal 
analysis and pattern recognition algorithms are applied. The 
digital-to-analog converter, power amplifier and loud speaker 
are used to verify the storage. If the output of this 
circuit matches the original signal, then the storage is 
successful. The system has been tested successfully by the 
aid of a function generator. Details of the hardware of the 
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system can be found elsewhere [9]. 

PATTERN RECOGNITION 

To recognize the spoken word, an algorithm that compares 
between the reference and the training patterns to see 
whether they match or not should be developed. Two 
approaches will be discussed here. The first is a 
traditional pattern classifier approach where training 
pattern is aligned in time with the reference pattern and the 
Euclidean distance between the two patterns is calculated and 
taken as the probability of matching. The second is based on 
implementing a backpropagation artificial neural network as 
the pattern classifier. In the following, we discuss briefly 
the two approaches . 

A. Dynamic Time Warping (DTW) 

The dynamic time warping (DTW) algorithm finds the 
"optimal" (least cost) warping path w(n) which minimizes the 
accumulated distance, D, between training and reference 
patterns, subject to a set of path and endpoint, constraints. 
The dynamic programming algorithm is based upon the fact that 
the optimal path to point (i,j) in the two dimensional matrix 
illustrated in Figure 3 must pass through either the point 
(i-l,j), or (i-l,j-l), or (i,j-l). The minimum accumulated 
distance to point (i,j) is then given by: 

D( i, j)=Dist( i , j )+Min(D( i-1 , j ) ,D( i-1 , j-1 ) ,D( i , j- 
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FIGURE 3 * DYNAMIC TIME ALIGNMENT 

where Dist (i,j) is the distance between the reference and 
training pattern at tine j. The algorithm recursively 
computes this distance column by column to determine the 
minimum accumulated distance to the point (M,N), where M is 
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the number of frames in the reference and N is the number of 
frames in the unknown (training). This path results in a time 
alignment in which the reference word has the maximum 
acoustic similarity with the input. 

B. Artificial Neural Network [10-18] 

Artificial neural network is a non-algor i thmic 
information processing structure based on the architecture of 
our biological nervous system. The structure is composed of 
a massive number of processing elements operating in a 
predetermined parallel operation. The processing elements 
are connected by links with variable weight. The topology of 
each network determines the way each processor is connected 
to the other. The link can be excitory, inhibitory, or has 
no effect on the activity of the processing element. See 
Figure 4. The primary processing at each element consists of 

Type I Type R 



the calculation of weighted sums of the form f(W i{ ,X|) and 
weight changes of the form W ij new= ,X A ,Xj . . . ) . the 
function f is usually a nonlinear function, Wjj is the weight 
of the link from element i to element j and X^ is the input 
to element i. See Figure 5. Figure 6 shows some of the most 
popular neural 
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FIGURE 6. CLASSIFICATION OF ARTIFICIAL NBURAL NETWORKS [10J 
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net models that can be used as classifiers with the classical 
algorithms that are most similar to the neural net models are 
listed along the bottom! 10]. As shown in this figure, these 
nets are first divided according to whether the input is 
binary or cont inuous-value . Second, they are divided 
according to whether they need supervision during training or 
not. Since a multi-layer perceptron backpropagation model is 
implemented in our study, we discuss in the following section 
the topology and the learning rule of that model. 

The Backpropagation Model [10,14,15] 

Figure 7 shows the topology of the backpropagation 

Output Patterns 



Input Patterns 

FIGURE 7. BACKPROPAGATION NETWORK. 
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model. The model consists of 3 layers (input, hidden, and 
output). each node in the input layer is connected to every 
node in the hidden layer also each node in the hidden layer 
is connected to every node in the output layer. The input of 
the model is a continuous valued vector x_, Xp .Xu ^ 
representing either the LPC coefficients or tne frequencies 
of the Formants which are calculated from the coefficients. 
Investigations, mainly experimental, will be carried out to 
see which input is most suitable for word recognition* The 
actual output y Q , yj,...y M _i is calculated as: 

y = f( ^ w ij x j ~ ®) where 9 is a predetermined threshold 

and the function f is the sigmoid nonlinearity: 


f (a) = 


1 + e 


-(a-9) 


Training of the network is carried out by setting the 
output of the model to the desired output vector d Q , 
d^,...d^_^. All elements of the desired output vector are 
set to zero except for that corresponding to the current 
input training word which is set to 1. The weights are 
adjusted recursively from the output nodes to the hidden 
nodes by the formula: 


5 »ij old * » Vi 

where u is the gain factor, is the error, and X’^ is 
either the output of node i or i.8 an input. If node j is an 
output node , then 


6 j = yjd-yjHdj-yj), 

where dj is the desired output of node j and y* is the 
actual output. If node j is an internal hidden node, then 

6 d = xj’U-Xj’) A k w jk , 


where k is over all nodes in the layers above node j 
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